Exercise pattern classification

author: jaketuricchi

This data was collected from activity tracking device(s) (unspecified). There is a training data set with a known class ranging
From A-E, which should be predicted in the test data set. Many of the features are not defined, limiting the domain-specific
preprocessing and EDA we can do.

The data can be found at: https://www.kaggle.com/athniv/exercisepatternpredict

Set up

Import, set, read, initial exploration

Import packages

In [1]:
import math
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import warnings 
import seaborn as sns
import sklearn
from datetime import datetime
import calendar
%matplotlib inline
pd.set_option('display.max_rows', 1000)

Set wd and read in data

In [2]:
os.chdir(r"C:/Users/jaket/Dropbox/Kaggle/Exercise_pattern_prediction")
In [3]:
train_df = pd.read_csv('pml-training.csv', error_bad_lines=False, index_col=False).drop('Unnamed: 0', axis=1)
test_df = pd.read_csv('pml-testing.csv', error_bad_lines=False, index_col=False).drop('Unnamed: 0', axis=1)
D:\Users\jaket\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3063: DtypeWarning: Columns (11,14,19,22,25,70,73,86,87,89,90,94,97,100) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

Randomise the rows, it is currently very structured and improve training predictability

In [4]:
train_df = train_df.sample(frac=1).reset_index(drop=True)

Explore

In [5]:
print(train_df.columns.values)
print(train_df.isna().sum()) 
['user_name' 'raw_timestamp_part_1' 'raw_timestamp_part_2'
 'cvtd_timestamp' 'new_window' 'num_window' 'roll_belt' 'pitch_belt'
 'yaw_belt' 'total_accel_belt' 'kurtosis_roll_belt' 'kurtosis_picth_belt'
 'kurtosis_yaw_belt' 'skewness_roll_belt' 'skewness_roll_belt.1'
 'skewness_yaw_belt' 'max_roll_belt' 'max_picth_belt' 'max_yaw_belt'
 'min_roll_belt' 'min_pitch_belt' 'min_yaw_belt' 'amplitude_roll_belt'
 'amplitude_pitch_belt' 'amplitude_yaw_belt' 'var_total_accel_belt'
 'avg_roll_belt' 'stddev_roll_belt' 'var_roll_belt' 'avg_pitch_belt'
 'stddev_pitch_belt' 'var_pitch_belt' 'avg_yaw_belt' 'stddev_yaw_belt'
 'var_yaw_belt' 'gyros_belt_x' 'gyros_belt_y' 'gyros_belt_z'
 'accel_belt_x' 'accel_belt_y' 'accel_belt_z' 'magnet_belt_x'
 'magnet_belt_y' 'magnet_belt_z' 'roll_arm' 'pitch_arm' 'yaw_arm'
 'total_accel_arm' 'var_accel_arm' 'avg_roll_arm' 'stddev_roll_arm'
 'var_roll_arm' 'avg_pitch_arm' 'stddev_pitch_arm' 'var_pitch_arm'
 'avg_yaw_arm' 'stddev_yaw_arm' 'var_yaw_arm' 'gyros_arm_x' 'gyros_arm_y'
 'gyros_arm_z' 'accel_arm_x' 'accel_arm_y' 'accel_arm_z' 'magnet_arm_x'
 'magnet_arm_y' 'magnet_arm_z' 'kurtosis_roll_arm' 'kurtosis_picth_arm'
 'kurtosis_yaw_arm' 'skewness_roll_arm' 'skewness_pitch_arm'
 'skewness_yaw_arm' 'max_roll_arm' 'max_picth_arm' 'max_yaw_arm'
 'min_roll_arm' 'min_pitch_arm' 'min_yaw_arm' 'amplitude_roll_arm'
 'amplitude_pitch_arm' 'amplitude_yaw_arm' 'roll_dumbbell'
 'pitch_dumbbell' 'yaw_dumbbell' 'kurtosis_roll_dumbbell'
 'kurtosis_picth_dumbbell' 'kurtosis_yaw_dumbbell'
 'skewness_roll_dumbbell' 'skewness_pitch_dumbbell'
 'skewness_yaw_dumbbell' 'max_roll_dumbbell' 'max_picth_dumbbell'
 'max_yaw_dumbbell' 'min_roll_dumbbell' 'min_pitch_dumbbell'
 'min_yaw_dumbbell' 'amplitude_roll_dumbbell' 'amplitude_pitch_dumbbell'
 'amplitude_yaw_dumbbell' 'total_accel_dumbbell' 'var_accel_dumbbell'
 'avg_roll_dumbbell' 'stddev_roll_dumbbell' 'var_roll_dumbbell'
 'avg_pitch_dumbbell' 'stddev_pitch_dumbbell' 'var_pitch_dumbbell'
 'avg_yaw_dumbbell' 'stddev_yaw_dumbbell' 'var_yaw_dumbbell'
 'gyros_dumbbell_x' 'gyros_dumbbell_y' 'gyros_dumbbell_z'
 'accel_dumbbell_x' 'accel_dumbbell_y' 'accel_dumbbell_z'
 'magnet_dumbbell_x' 'magnet_dumbbell_y' 'magnet_dumbbell_z'
 'roll_forearm' 'pitch_forearm' 'yaw_forearm' 'kurtosis_roll_forearm'
 'kurtosis_picth_forearm' 'kurtosis_yaw_forearm' 'skewness_roll_forearm'
 'skewness_pitch_forearm' 'skewness_yaw_forearm' 'max_roll_forearm'
 'max_picth_forearm' 'max_yaw_forearm' 'min_roll_forearm'
 'min_pitch_forearm' 'min_yaw_forearm' 'amplitude_roll_forearm'
 'amplitude_pitch_forearm' 'amplitude_yaw_forearm' 'total_accel_forearm'
 'var_accel_forearm' 'avg_roll_forearm' 'stddev_roll_forearm'
 'var_roll_forearm' 'avg_pitch_forearm' 'stddev_pitch_forearm'
 'var_pitch_forearm' 'avg_yaw_forearm' 'stddev_yaw_forearm'
 'var_yaw_forearm' 'gyros_forearm_x' 'gyros_forearm_y' 'gyros_forearm_z'
 'accel_forearm_x' 'accel_forearm_y' 'accel_forearm_z' 'magnet_forearm_x'
 'magnet_forearm_y' 'magnet_forearm_z' 'classe']
user_name                       0
raw_timestamp_part_1            0
raw_timestamp_part_2            0
cvtd_timestamp                  0
new_window                      0
num_window                      0
roll_belt                       0
pitch_belt                      0
yaw_belt                        0
total_accel_belt                0
kurtosis_roll_belt          19216
kurtosis_picth_belt         19216
kurtosis_yaw_belt           19216
skewness_roll_belt          19216
skewness_roll_belt.1        19216
skewness_yaw_belt           19216
max_roll_belt               19216
max_picth_belt              19216
max_yaw_belt                19216
min_roll_belt               19216
min_pitch_belt              19216
min_yaw_belt                19216
amplitude_roll_belt         19216
amplitude_pitch_belt        19216
amplitude_yaw_belt          19216
var_total_accel_belt        19216
avg_roll_belt               19216
stddev_roll_belt            19216
var_roll_belt               19216
avg_pitch_belt              19216
stddev_pitch_belt           19216
var_pitch_belt              19216
avg_yaw_belt                19216
stddev_yaw_belt             19216
var_yaw_belt                19216
gyros_belt_x                    0
gyros_belt_y                    0
gyros_belt_z                    0
accel_belt_x                    0
accel_belt_y                    0
accel_belt_z                    0
magnet_belt_x                   0
magnet_belt_y                   0
magnet_belt_z                   0
roll_arm                        0
pitch_arm                       0
yaw_arm                         0
total_accel_arm                 0
var_accel_arm               19216
avg_roll_arm                19216
stddev_roll_arm             19216
var_roll_arm                19216
avg_pitch_arm               19216
stddev_pitch_arm            19216
var_pitch_arm               19216
avg_yaw_arm                 19216
stddev_yaw_arm              19216
var_yaw_arm                 19216
gyros_arm_x                     0
gyros_arm_y                     0
gyros_arm_z                     0
accel_arm_x                     0
accel_arm_y                     0
accel_arm_z                     0
magnet_arm_x                    0
magnet_arm_y                    0
magnet_arm_z                    0
kurtosis_roll_arm           19216
kurtosis_picth_arm          19216
kurtosis_yaw_arm            19216
skewness_roll_arm           19216
skewness_pitch_arm          19216
skewness_yaw_arm            19216
max_roll_arm                19216
max_picth_arm               19216
max_yaw_arm                 19216
min_roll_arm                19216
min_pitch_arm               19216
min_yaw_arm                 19216
amplitude_roll_arm          19216
amplitude_pitch_arm         19216
amplitude_yaw_arm           19216
roll_dumbbell                   0
pitch_dumbbell                  0
yaw_dumbbell                    0
kurtosis_roll_dumbbell      19216
kurtosis_picth_dumbbell     19216
kurtosis_yaw_dumbbell       19216
skewness_roll_dumbbell      19216
skewness_pitch_dumbbell     19216
skewness_yaw_dumbbell       19216
max_roll_dumbbell           19216
max_picth_dumbbell          19216
max_yaw_dumbbell            19216
min_roll_dumbbell           19216
min_pitch_dumbbell          19216
min_yaw_dumbbell            19216
amplitude_roll_dumbbell     19216
amplitude_pitch_dumbbell    19216
amplitude_yaw_dumbbell      19216
total_accel_dumbbell            0
var_accel_dumbbell          19216
avg_roll_dumbbell           19216
stddev_roll_dumbbell        19216
var_roll_dumbbell           19216
avg_pitch_dumbbell          19216
stddev_pitch_dumbbell       19216
var_pitch_dumbbell          19216
avg_yaw_dumbbell            19216
stddev_yaw_dumbbell         19216
var_yaw_dumbbell            19216
gyros_dumbbell_x                0
gyros_dumbbell_y                0
gyros_dumbbell_z                0
accel_dumbbell_x                0
accel_dumbbell_y                0
accel_dumbbell_z                0
magnet_dumbbell_x               0
magnet_dumbbell_y               0
magnet_dumbbell_z               0
roll_forearm                    0
pitch_forearm                   0
yaw_forearm                     0
kurtosis_roll_forearm       19216
kurtosis_picth_forearm      19216
kurtosis_yaw_forearm        19216
skewness_roll_forearm       19216
skewness_pitch_forearm      19216
skewness_yaw_forearm        19216
max_roll_forearm            19216
max_picth_forearm           19216
max_yaw_forearm             19216
min_roll_forearm            19216
min_pitch_forearm           19216
min_yaw_forearm             19216
amplitude_roll_forearm      19216
amplitude_pitch_forearm     19216
amplitude_yaw_forearm       19216
total_accel_forearm             0
var_accel_forearm           19216
avg_roll_forearm            19216
stddev_roll_forearm         19216
var_roll_forearm            19216
avg_pitch_forearm           19216
stddev_pitch_forearm        19216
var_pitch_forearm           19216
avg_yaw_forearm             19216
stddev_yaw_forearm          19216
var_yaw_forearm             19216
gyros_forearm_x                 0
gyros_forearm_y                 0
gyros_forearm_z                 0
accel_forearm_x                 0
accel_forearm_y                 0
accel_forearm_z                 0
magnet_forearm_x                0
magnet_forearm_y                0
magnet_forearm_z                0
classe                          0
dtype: int64
In [6]:
sns.heatmap(train_df.isnull(), cbar=False) # Heatmap to visualise NAs
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f6cc4d4248>

It's clear we have a large amount of missing data in some columns. It's likely that where this data is available is specific to
a certain exercise type/class. Due to the amount, imputation will not be effective. In all other columns there is no missing data
It could be appropriate to create a seperate df in the cases which have this additional data.

In [7]:
train_df.describe()
Out[7]:
raw_timestamp_part_1 raw_timestamp_part_2 num_window roll_belt pitch_belt yaw_belt total_accel_belt max_roll_belt max_picth_belt min_roll_belt ... var_yaw_forearm gyros_forearm_x gyros_forearm_y gyros_forearm_z accel_forearm_x accel_forearm_y accel_forearm_z magnet_forearm_x magnet_forearm_y magnet_forearm_z
count 1.962200e+04 19622.000000 19622.000000 19622.000000 19622.000000 19622.000000 19622.000000 406.000000 406.000000 406.000000 ... 406.000000 19622.000000 19622.000000 19622.000000 19622.000000 19622.000000 19622.000000 19622.000000 19622.000000 19622.000000
mean 1.322827e+09 500656.144277 430.640047 64.407197 0.305283 -11.205061 11.312608 -6.667241 12.923645 -10.436453 ... 4639.849068 0.157951 0.075175 0.151245 -61.651819 163.655896 -55.291917 -312.575884 380.116445 393.613745
std 2.049277e+05 288222.879958 247.909554 62.750255 22.351242 95.193926 7.742309 94.594252 8.005960 93.616774 ... 7284.972361 0.648618 3.100725 1.754483 180.593687 200.130082 138.396947 346.958482 509.373742 369.268747
min 1.322490e+09 294.000000 1.000000 -28.900000 -55.800000 -180.000000 0.000000 -94.300000 3.000000 -180.000000 ... 0.000000 -22.000000 -7.020000 -8.090000 -498.000000 -632.000000 -446.000000 -1280.000000 -896.000000 -973.000000
25% 1.322673e+09 252912.250000 222.000000 1.100000 1.760000 -88.300000 3.000000 -88.000000 5.000000 -88.400000 ... 0.274550 -0.220000 -1.460000 -0.180000 -178.000000 57.000000 -182.000000 -616.000000 2.000000 191.000000
50% 1.322833e+09 496380.000000 424.000000 113.000000 5.280000 -13.000000 17.000000 -5.100000 18.000000 -7.850000 ... 612.214225 0.050000 0.030000 0.080000 -57.000000 201.000000 -39.000000 -378.000000 591.000000 511.000000
75% 1.323084e+09 751890.750000 644.000000 123.000000 14.900000 12.900000 18.000000 18.500000 19.000000 9.050000 ... 7368.414252 0.560000 1.620000 0.490000 76.000000 312.000000 26.000000 -73.000000 737.000000 653.000000
max 1.323095e+09 998801.000000 864.000000 162.000000 60.300000 179.000000 29.000000 180.000000 30.000000 173.000000 ... 39009.333330 3.970000 311.000000 231.000000 477.000000 923.000000 291.000000 672.000000 1480.000000 1090.000000

8 rows × 122 columns

EDA

First, it would make sense to run some pairplots using class as the Hue, so we can begin to determine which variables are related
To the target. We can only do this in the training data set. Lets start by examining the frequency of classes and participants

In [8]:
train_df['classe']=train_df['classe'].astype('category')
In [9]:
freq_plot1=train_df.filter(items=['user_name', 'classe'])
freq_plot1=freq_plot1.groupby(['user_name'])['classe'].agg(counts='value_counts').reset_index()
In [10]:
sns.barplot(data = freq_plot1, x = 'counts', y = 'user_name', hue = 'classe', ci = None)
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f6cc1cca48>

Its clear that the frequency of class A is much greater than each other class. The classes are not divided equally between users.
For example, Jeremy has around 2x as many A's as other classes. This is important as user may be an important predictive
Variable when predicting the test data, and we will later need to OH Encode user.

Now lets plot some pairplots...

In [11]:
            
pairplot1=train_df.filter(items=['num_window', 'roll_belt', 'pitch_belt', 'yaw_belt', 'total_accel_belt', 'classe'])
sns.pairplot(pairplot1, hue='classe',  plot_kws = {'alpha': 0.6, 'edgecolor': 'k'},size = 4)
D:\Users\jaket\anaconda3\lib\site-packages\seaborn\axisgrid.py:2079: UserWarning: The `size` parameter has been renamed to `height`; please update your code.
  warnings.warn(msg, UserWarning)
Out[11]:
<seaborn.axisgrid.PairGrid at 0x1f6cca6b488>

There aren't any large differences in these relationships regarding class. However, the distribution does seem to be slightly different
Across different classes. Lets look at the X and Y axes of the gyros, accel and magnet belt.

In [12]:
pairplot2=train_df.filter(items=['num_window', 'gyros_belt_x', 'gyros_belt_y', 'accel_belt_x', 'accel_belt_y',  'magnet_belt_x','magnet_belt_y', 'classe'])
sns.pairplot(pairplot2, hue='classe',  plot_kws = {'alpha': 0.6,  'edgecolor': 'k'},size = 4)
D:\Users\jaket\anaconda3\lib\site-packages\seaborn\axisgrid.py:2079: UserWarning: The `size` parameter has been renamed to `height`; please update your code.
  warnings.warn(msg, UserWarning)
Out[12]:
<seaborn.axisgrid.PairGrid at 0x1f6cf00ec48>

Again the data is closely grouped, though we can see that movement D has some determinable features not shared by the others on
These axes. Can the Z-axes provide any more information?

In [13]:
pairplot3=train_df.filter(items=['num_window', 'gyros_belt_z', 'accel_belt_z', 'magnet_belt_z', 'classe'])
sns.pairplot(pairplot3, hue='classe',  plot_kws = {'alpha': 0.6, 'edgecolor': 'k'},size = 4)
D:\Users\jaket\anaconda3\lib\site-packages\seaborn\axisgrid.py:2079: UserWarning: The `size` parameter has been renamed to `height`; please update your code.
  warnings.warn(msg, UserWarning)
Out[13]:
<seaborn.axisgrid.PairGrid at 0x1f6d7b551c8>

Now we see some extremely distinctive features of class D on the Z axis. This means these variables will be important in classification.

In [14]:
pairplot4=train_df.filter(items=['roll_arm', 'pitch_arm', 'yaw_arm', 'total_accel_arm', 'classe'])
sns.pairplot(pairplot4, hue='classe',  plot_kws = {'alpha': 0.6, 'edgecolor': 'k'},size = 4)
D:\Users\jaket\anaconda3\lib\site-packages\seaborn\axisgrid.py:2079: UserWarning: The `size` parameter has been renamed to `height`; please update your code.
  warnings.warn(msg, UserWarning)
Out[14]:
<seaborn.axisgrid.PairGrid at 0x1f6df907848>

The arm data shows significantly different patterns from the belt data. However, there is not a considerable difference between classes.
Again, lets look at them on the x/y and then the Z axis.

In [15]:
pairplot5=train_df.filter(items=['num_window', 'gyros_arm_x', 'gyros_arm_y', 'accel_arm_x', 'accel_arm_y',  'magnet_arm_x','magnet_arm_y', 'classe'])
sns.pairplot(pairplot5, hue='classe',  plot_kws = {'alpha': 0.6,  'edgecolor': 'k'},size = 4)
D:\Users\jaket\anaconda3\lib\site-packages\seaborn\axisgrid.py:2079: UserWarning: The `size` parameter has been renamed to `height`; please update your code.
  warnings.warn(msg, UserWarning)
Out[15]:
<seaborn.axisgrid.PairGrid at 0x1f6e32f0bc8>

And on the Z:

In [16]:
pairplot6=train_df.filter(items=['num_window', 'gyros_arm_z', 'accel_arm_z', 'magnet_arm_z', 'classe'])
sns.pairplot(pairplot6, hue='classe',  plot_kws = {'alpha': 0.6, 'edgecolor': 'k'},size = 4)
D:\Users\jaket\anaconda3\lib\site-packages\seaborn\axisgrid.py:2079: UserWarning: The `size` parameter has been renamed to `height`; please update your code.
  warnings.warn(msg, UserWarning)
Out[16]:
<seaborn.axisgrid.PairGrid at 0x1f6e8ec7e88>

The differences here are again subtle. The distribution of A seems to take a distinctly different pattern from the others.

We can do similar plots for the forearm variables if we wanted but i'll skip it for now. Lastly, lets take a look at some of the
Variables where nearly all data is missing. If this is uninformative, it makes sense for us to remove it, however it may give
away a certain class clearly. Since there are many of these I shall just pick a few out.

In [17]:
pairplot7=train_df.filter(items=['skewness_roll_belt', 'max_roll_belt', 'max_picth_belt', 
                                 'var_total_accel_belt', 'stdev_roll_belt',
                                 'avg_yaw_belt', 'classe'])
sns.pairplot(pairplot7, hue='classe',  plot_kws = {'alpha': 0.6, 'edgecolor': 'k'},size = 4)
D:\Users\jaket\anaconda3\lib\site-packages\seaborn\axisgrid.py:2079: UserWarning: The `size` parameter has been renamed to `height`; please update your code.
  warnings.warn(msg, UserWarning)
Out[17]:
<seaborn.axisgrid.PairGrid at 0x1f6ec2027c8>

These dont show any clear associations with class. Given the small (<1%) fraction of the data available, it makes sense that we remove these.

Preprocessing

Before we move to modelling we should consider feature removal, feature engineering and categorical encoding amongst other things.

Feature Removal

First, drop cols with high % NA

In [18]:
print(train_df.isna().sum()) 
train_df = train_df.loc[:, train_df.isnull().mean() < .8] #remove cols with <80% completeness.
test_df = test_df.loc[:, test_df.isnull().mean() < .8] #remove cols with <80% completeness.
user_name                       0
raw_timestamp_part_1            0
raw_timestamp_part_2            0
cvtd_timestamp                  0
new_window                      0
num_window                      0
roll_belt                       0
pitch_belt                      0
yaw_belt                        0
total_accel_belt                0
kurtosis_roll_belt          19216
kurtosis_picth_belt         19216
kurtosis_yaw_belt           19216
skewness_roll_belt          19216
skewness_roll_belt.1        19216
skewness_yaw_belt           19216
max_roll_belt               19216
max_picth_belt              19216
max_yaw_belt                19216
min_roll_belt               19216
min_pitch_belt              19216
min_yaw_belt                19216
amplitude_roll_belt         19216
amplitude_pitch_belt        19216
amplitude_yaw_belt          19216
var_total_accel_belt        19216
avg_roll_belt               19216
stddev_roll_belt            19216
var_roll_belt               19216
avg_pitch_belt              19216
stddev_pitch_belt           19216
var_pitch_belt              19216
avg_yaw_belt                19216
stddev_yaw_belt             19216
var_yaw_belt                19216
gyros_belt_x                    0
gyros_belt_y                    0
gyros_belt_z                    0
accel_belt_x                    0
accel_belt_y                    0
accel_belt_z                    0
magnet_belt_x                   0
magnet_belt_y                   0
magnet_belt_z                   0
roll_arm                        0
pitch_arm                       0
yaw_arm                         0
total_accel_arm                 0
var_accel_arm               19216
avg_roll_arm                19216
stddev_roll_arm             19216
var_roll_arm                19216
avg_pitch_arm               19216
stddev_pitch_arm            19216
var_pitch_arm               19216
avg_yaw_arm                 19216
stddev_yaw_arm              19216
var_yaw_arm                 19216
gyros_arm_x                     0
gyros_arm_y                     0
gyros_arm_z                     0
accel_arm_x                     0
accel_arm_y                     0
accel_arm_z                     0
magnet_arm_x                    0
magnet_arm_y                    0
magnet_arm_z                    0
kurtosis_roll_arm           19216
kurtosis_picth_arm          19216
kurtosis_yaw_arm            19216
skewness_roll_arm           19216
skewness_pitch_arm          19216
skewness_yaw_arm            19216
max_roll_arm                19216
max_picth_arm               19216
max_yaw_arm                 19216
min_roll_arm                19216
min_pitch_arm               19216
min_yaw_arm                 19216
amplitude_roll_arm          19216
amplitude_pitch_arm         19216
amplitude_yaw_arm           19216
roll_dumbbell                   0
pitch_dumbbell                  0
yaw_dumbbell                    0
kurtosis_roll_dumbbell      19216
kurtosis_picth_dumbbell     19216
kurtosis_yaw_dumbbell       19216
skewness_roll_dumbbell      19216
skewness_pitch_dumbbell     19216
skewness_yaw_dumbbell       19216
max_roll_dumbbell           19216
max_picth_dumbbell          19216
max_yaw_dumbbell            19216
min_roll_dumbbell           19216
min_pitch_dumbbell          19216
min_yaw_dumbbell            19216
amplitude_roll_dumbbell     19216
amplitude_pitch_dumbbell    19216
amplitude_yaw_dumbbell      19216
total_accel_dumbbell            0
var_accel_dumbbell          19216
avg_roll_dumbbell           19216
stddev_roll_dumbbell        19216
var_roll_dumbbell           19216
avg_pitch_dumbbell          19216
stddev_pitch_dumbbell       19216
var_pitch_dumbbell          19216
avg_yaw_dumbbell            19216
stddev_yaw_dumbbell         19216
var_yaw_dumbbell            19216
gyros_dumbbell_x                0
gyros_dumbbell_y                0
gyros_dumbbell_z                0
accel_dumbbell_x                0
accel_dumbbell_y                0
accel_dumbbell_z                0
magnet_dumbbell_x               0
magnet_dumbbell_y               0
magnet_dumbbell_z               0
roll_forearm                    0
pitch_forearm                   0
yaw_forearm                     0
kurtosis_roll_forearm       19216
kurtosis_picth_forearm      19216
kurtosis_yaw_forearm        19216
skewness_roll_forearm       19216
skewness_pitch_forearm      19216
skewness_yaw_forearm        19216
max_roll_forearm            19216
max_picth_forearm           19216
max_yaw_forearm             19216
min_roll_forearm            19216
min_pitch_forearm           19216
min_yaw_forearm             19216
amplitude_roll_forearm      19216
amplitude_pitch_forearm     19216
amplitude_yaw_forearm       19216
total_accel_forearm             0
var_accel_forearm           19216
avg_roll_forearm            19216
stddev_roll_forearm         19216
var_roll_forearm            19216
avg_pitch_forearm           19216
stddev_pitch_forearm        19216
var_pitch_forearm           19216
avg_yaw_forearm             19216
stddev_yaw_forearm          19216
var_yaw_forearm             19216
gyros_forearm_x                 0
gyros_forearm_y                 0
gyros_forearm_z                 0
accel_forearm_x                 0
accel_forearm_y                 0
accel_forearm_z                 0
magnet_forearm_x                0
magnet_forearm_y                0
magnet_forearm_z                0
classe                          0
dtype: int64

The timestamps are not going to be useful for predicting on the other data sets. Also, they would not be informative in real-life
activity prediction. It make be that correctly used these timestampts could give away the entire answer if they align with the
Test data, which will be the case if the test data is a random sample of train. We'll lose these for now. to make it more realistic.
For the same reasons, it makes sense to also lose num window and new window

In [19]:
train_df = train_df.drop(['raw_timestamp_part_1', 'raw_timestamp_part_2' ,'cvtd_timestamp', 'new_window','num_window'], axis=1)
test_df = test_df.drop(['raw_timestamp_part_1', 'raw_timestamp_part_2' ,'cvtd_timestamp', 'new_window','num_window', 'problem_id'], axis=1)

Feature Engineering

As we dont have an excessive number of features, nor a considerable amount of categorical variables to expand our X-cols, we
Can consider generating some interaction features. For example, lets combine all x y and z

Generate a fn to turn 0s into 1s as it not ruin the interaction variables

In [20]:
def zeros_to_ones(x):
    x = np.where(x==0, 1, x)
    return(x)

Np.prod will give us the product (multiple) of all columns for a given row, creating an interaction variable on the axis.

In [21]:
def feat_eng (df):
    df['x_axis_feat']=df[df.columns[df.columns.to_series().str.contains('_x')]].apply(zeros_to_ones).apply(np.prod, axis=1)
    df['y_axis_feat']=df[df.columns[df.columns.to_series().str.contains('_y')]].apply(zeros_to_ones).apply(np.prod, axis=1)
    df['z_axis_feat']=df[df.columns[df.columns.to_series().str.contains('_z')]].apply(zeros_to_ones).apply(np.prod, axis=1)
    
    # Lets interact all belt, arm, dumbell and forearm variables
    
    df['belt_feat']=df[df.columns[df.columns.to_series().str.contains('_belt')]].apply(zeros_to_ones).apply(np.prod, axis=1)
    df['arm_feat']=df[df.columns[df.columns.to_series().str.contains('_arm')]].apply(zeros_to_ones).apply(np.prod, axis=1)
    df['forearm_feat']=df[df.columns[df.columns.to_series().str.contains('_forearm')]].apply(zeros_to_ones).apply(np.prod, axis=1)
    
    # Let's interact all magnet, accel and gyros variables
    
    df['accel_feat']=df[df.columns[df.columns.to_series().str.contains('accel_')]].apply(zeros_to_ones).apply(np.prod, axis=1)
    df['magnet_feat']=df[df.columns[df.columns.to_series().str.contains('magnet_')]].apply(zeros_to_ones).apply(np.prod, axis=1)
    df['gyros_feat']=df[df.columns[df.columns.to_series().str.contains('gyros_')]].apply(zeros_to_ones).apply(np.prod, axis=1)
    
    return(df)
In [22]:
train_df=feat_eng(train_df)
test_df=feat_eng(test_df)

We could continue to generate more features by interacting newly engineered features, or in new combinations, and this may give
us some additional model performance however due to time and computation restraints we'll leave it here.

Encoding

Only 2 encoding processes need to be done. (1) to one hot encode the user and (2) to label encode the outcome.

In [23]:
def Encode_fn(df):
    users=pd.get_dummies(df['user_name']) #OneHot encode username
    df=pd.concat([df, users], axis=1).reset_index(drop=True) #Join to modelling df
    df=df.drop('user_name', axis=1) #Drop original username var
    return(df)
In [24]:
train_df=Encode_fn(train_df)
test_df=Encode_fn(test_df)

Label encode target

In [25]:
train_df['classe']=train_df['classe'].astype('category') # Ensure the target is cat
train_df['target']=train_df['classe'].cat.codes # Label encoding
train_df['target']=train_df['target'].astype('category') # Ensure the target is cat
train_df=train_df.drop('classe', axis=1)

Splitting

In [26]:
from sklearn.model_selection import train_test_split

Define features and labels

In [27]:
X=train_df.drop('target', axis=1).reset_index(drop=True)
y=train_df['target']

efine train and test

In [28]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

Initial classification testing

Load packages

In [29]:
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis,  QuadraticDiscriminantAnalysis
from sklearn.svm import SVC, LinearSVC, NuSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, log_loss, precision_score, recall_score, f1_score

Select classification algos

In [30]:
classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="rbf", C=0.025, probability=True),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    AdaBoostClassifier(),
    GradientBoostingClassifier(),
    GaussianNB(),
    LinearDiscriminantAnalysis(),
    QuadraticDiscriminantAnalysis()]

Log results for performance vis

In [31]:
log_cols=["Classifier", "Accuracy", "Log Loss"]
log = pd.DataFrame(columns=log_cols)

Run algo loop

In [32]:
for clf in classifiers:
    clf.fit(X_train, y_train)
    name = clf.__class__.__name__
    
    print("="*30)
    print(name)
    
    print('****Results****')
    train_predictions = clf.predict(X_test)
    acc = accuracy_score(y_test, train_predictions)
    
    # calculate score
    precision = precision_score(y_test, train_predictions, average = 'macro') 
    recall = recall_score(y_test, train_predictions, average = 'macro') 
    f_score = f1_score(y_test, train_predictions, average = 'macro')
    
    
    print("Precision: {:.4%}".format(precision))
    print("Recall: {:.4%}".format(recall))
    print("F-score: {:.4%}".format(recall))
    print("Accuracy: {:.4%}".format(acc))
    
    train_predictions = clf.predict_proba(X_test)
    ll = log_loss(y_test, train_predictions)
    print("Log Loss: {}".format(ll))
    
    log_entry = pd.DataFrame([[name, acc*100, ll]], columns=log_cols)
    log = log.append(log_entry)
    
print("="*30)
==============================
KNeighborsClassifier
****Results****
Precision: 21.8858%
Recall: 21.4819%
F-score: 21.4819%
Accuracy: 24.9427%
Log Loss: 17.096479427694035
==============================
SVC
****Results****
D:\Users\jaket\anaconda3\lib\site-packages\sklearn\metrics\_classification.py:1272: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
Precision: 16.8266%
Recall: 20.5977%
F-score: 20.5977%
Accuracy: 28.8662%
Log Loss: 1.5740662843237314
==============================
DecisionTreeClassifier
****Results****
Precision: 95.7420%
Recall: 95.8698%
F-score: 95.8698%
Accuracy: 96.0000%
Log Loss: 1.3815510557964323
==============================
RandomForestClassifier
****Results****
Precision: 99.4834%
Recall: 99.4606%
F-score: 99.4606%
Accuracy: 99.5159%
Log Loss: 0.10794318581085104
==============================
AdaBoostClassifier
****Results****
Precision: 70.3643%
Recall: 71.2586%
F-score: 71.2586%
Accuracy: 70.7006%
Log Loss: 1.4350739471650658
==============================
GradientBoostingClassifier
****Results****
Precision: 96.5362%
Recall: 96.4782%
F-score: 96.4782%
Accuracy: 96.6879%
Log Loss: 0.192617002639186
==============================
GaussianNB
****Results****
Precision: 26.5096%
Recall: 20.1935%
F-score: 20.1935%
Accuracy: 27.9745%
D:\Users\jaket\anaconda3\lib\site-packages\sklearn\metrics\_classification.py:1272: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
Log Loss: 1.8194538429241887
==============================
LinearDiscriminantAnalysis
****Results****
Precision: 73.9565%
Recall: 72.4391%
F-score: 72.4391%
Accuracy: 73.4777%
Log Loss: 0.7388361825592871
==============================
QuadraticDiscriminantAnalysis
****Results****
Precision: 39.5882%
Recall: 25.4597%
F-score: 25.4597%
Accuracy: 32.2293%
Log Loss: 14.849054242270288
==============================

Plot results of algo testing

Accuracy

In [33]:
sns.barplot(x='Accuracy', y='Classifier', data=log, color="b")
Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f6d18181c8>

Log Loss.

In [34]:
sns.set_color_codes("muted")
sns.barplot(x='Log Loss', y='Classifier', data=log, color="g")
Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f6f034f308>

It is clear that random forest does an extremely good job of classifying these. Usually I would opt to tune multiple algos but based
On the accuracy of the RF i'll just do some brief tuning.

First, lets consider variable importance

In [35]:
rf = RandomForestClassifier(n_estimators=500, random_state = 42)
rf.fit(X_train, y_train);
feat_importances = pd.Series(rf.feature_importances_, index=X_train.columns)
feat_importances.nlargest(25).plot(kind='barh')
Out[35]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f6f04376c8>

No engineered features seem that important according to the RF, likely because theyre interactions (/derivatives). It would make sense
to go back and remove these to test the RF accuracy without engineering, but i'll leave that for now.

Parameter tuning

We'll tune the RF using a random search grid. We could use a grid search but it is computationally expensive and given that we're
On >99% accuracy and only need to predict a data set size of 20, I think we can manage without.

Run randomized search

In [36]:
from sklearn.model_selection import RandomizedSearchCV

Number of trees in random forest

In [37]:
n_estimators = [int(x) for x in np.linspace(start = 10, stop = 20, num = 10)]

Number of features to consider at every split

In [38]:
max_features = ['auto', 'sqrt']

Maximum number of levels in tree

In [39]:
max_depth = [int(x) for x in np.linspace(10, 1000, num = 10)]
max_depth.append(None)

Minimum number of samples required to split a node

In [40]:
min_samples_split = [2, 5, 10]

Minimum number of samples required at each leaf node

In [41]:
min_samples_leaf = [2, 4, 10, 100]

Method of selecting samples for training each tree

In [42]:
bootstrap = [True, False]

Create the random grid

In [43]:
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
print(random_grid)
{'n_estimators': [10, 11, 12, 13, 14, 15, 16, 17, 18, 20], 'max_features': ['auto', 'sqrt'], 'max_depth': [10, 120, 230, 340, 450, 560, 670, 780, 890, 1000, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [2, 4, 10, 100], 'bootstrap': [True, False]}

Use the random grid to search for best hyperparameters
Random search of parameters, using 3 fold cross validation,
Search across 100 different combinations, and use all available cores

In [44]:
rf_random = RandomizedSearchCV(estimator = rf, 
                               param_distributions = random_grid, 
                               n_iter = 100, cv = 3, verbose=2, 
                               random_state=42, n_jobs = -1)
# Fit the random search model
rf_random.fit(X_train, y_train)
Fitting 3 folds for each of 100 candidates, totalling 300 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   11.3s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:   42.0s
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:  1.3min finished
Out[44]:
RandomizedSearchCV(cv=3, error_score=nan,
                   estimator=RandomForestClassifier(bootstrap=True,
                                                    ccp_alpha=0.0,
                                                    class_weight=None,
                                                    criterion='gini',
                                                    max_depth=None,
                                                    max_features='auto',
                                                    max_leaf_nodes=None,
                                                    max_samples=None,
                                                    min_impurity_decrease=0.0,
                                                    min_impurity_split=None,
                                                    min_samples_leaf=1,
                                                    min_samples_split=2,
                                                    min_weight_fraction_leaf=0.0,
                                                    n_estimators=500,
                                                    n_jobs...
                   iid='deprecated', n_iter=100, n_jobs=-1,
                   param_distributions={'bootstrap': [True, False],
                                        'max_depth': [10, 120, 230, 340, 450,
                                                      560, 670, 780, 890, 1000,
                                                      None],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': [2, 4, 10, 100],
                                        'min_samples_split': [2, 5, 10],
                                        'n_estimators': [10, 11, 12, 13, 14, 15,
                                                         16, 17, 18, 20]},
                   pre_dispatch='2*n_jobs', random_state=42, refit=True,
                   return_train_score=False, scoring=None, verbose=2)
In [45]:
print(rf_random.best_params_)
{'n_estimators': 18, 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': 890, 'bootstrap': False}

Fit the tuned model

In [46]:
best_params_rf = rf_random.best_estimator_
best_params_rf.fit(X_train,y_train)
Out[46]:
RandomForestClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=890, max_features='sqrt',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=2, min_samples_split=5,
                       min_weight_fraction_leaf=0.0, n_estimators=18,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

Predict test data

In [47]:
y_pred_rf = best_params_rf.predict(X_test)

Evaluate

In [48]:
precision = precision_score(y_test, y_pred_rf, average = 'macro') 
recall = recall_score(y_test, y_pred_rf, average = 'macro') 
f_score = f1_score(y_test, y_pred_rf, average = 'macro')
    
    
print("Precision: {:.4%}".format(precision))
print("Recall: {:.4%}".format(recall))
print("F-score: {:.4%}".format(recall))
Precision: 99.5159%
Recall: 99.5043%
F-score: 99.5043%

Final Predictions

In [49]:
final_predictions = best_params_rf.predict(test_df)
In [50]:
print(final_predictions)
[1 0 1 0 0 4 3 1 0 0 1 2 1 0 4 4 0 1 1 1]

Convert Py to Notebook